Organization Influence on Loop Blocking François Bodin
نویسنده
چکیده
Performance tuning on today`s computers has become very complex. One of the factor of this complexity is the use of memory hierarchies, and particularly of cache memories. Code transformations such as loop blocking are used for improving temporal locality in numerical codes. Unfortunately, the behavior of direct-mapped caches and set-associative caches are very sensitive to parameters such as the respective placement of arrays determined by the leading sizes of arrays. This leads sometimes to unpredictable and catastrophic performance even on blocked numerical kernels. Most users are not expert in cache organizations and cannot be aware of such phenomena. In this paper, we show that the recently proposed 4-way skewed associative cache is quite insensitive to array placements in memory, and then provides to the user a quite stable and predictable behavior on the basic algorithms as well as blocked algorithms. The average behavior of the 4-way skewed-associative cache is also better than the average behavior of the 4-way set-associative cache on all algorithm versions. When using the 4-way skewed associative cache, copying is never necessary for getting predictable performance while it is generally the only mean to get such predictable performance on set-associative and direct-mapped caches. Furthermore, on blocked algorithms, a large fraction of the cache space in a 4-way skewed associative cache may be used for blocking the loop, thus leading to a limited overhead due to blocking. De l'innuence de l'organisation des caches sur le blocage des boucles R esum e : Les performances des ordinateurs d'aujourd'hui sont devenues tr es diiciles a exploiter et a pr evoir. Un des facteurs rendant cette pr evision extr e-mement complexe est l'utilisation de hi erarchies m emoires, et en particulier de m emoire caches. Des transformations de programmes telles que le blocage de boucles peuvent ^ etre utilis ees pour am eliorer la localit e des applications dans les codes num eriques. Malheureusement, le comportement des caches a corres-pondance directes et des caches associatifs par ensemble sont tr es sensibles a des param etres tels que le placement des tableaux en m emoire ; ceci entraine parfois des chutes de performances impr edictibles et catastrophiques m^ eme sur des codes bloqu es. La plupart des utilisateurs ne peuvent pas ^ etre conscients de tels ph enom enes. Dans cet article, nous montrons que le cache associatif brouill e 4 voies que nous avons r ecemment propos e est a peu …
منابع مشابه
Skewed Associativity Improves Program Performance and Enhances Predictability
Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for crating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. 1 Abstract Performance tuning becomes harder as computer technology advanc...
متن کاملParallel Sparse Matrix by Vector Multiplication using a Shared Virtual Memory Environment
Many iterative schemes in scientiic applications require the multiplication of a sparse matrix by a vector. This kernel has been mainly studied on vector processors and shared-memory parallel computers. In this paper, we address the implementation issues when using a shared virtual memory system on a distributed memory parallel computer. We study in details the impact of loop distribution schem...
متن کاملA Quantitative Algorithm for Data Locality Optimization
In this paper, we consider the problem of optimizing register allocation and cache behavior for loop array references. We exploit techniques developed initially for data locality estimation and improvement in the framework of cache or local memories. First we review the concept of \reference window" that serves as our basic tool for both data locality evaluation and management. Then we study ho...
متن کاملA Machine Learning Approach to Automatic Production of Compiler Heuristics
Achieving high performance on modern processors heavily relies on the compiler optimizations to exploit the microprocessor architecture. The efficiency of optimization directly depends on the compiler heuristics. These heuristics must be target-specific and each new processor generation requires heuristics reengineering. In this paper, we address the automatic generation of optimization heurist...
متن کامل